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Abstract — In  wireless  sensor  networks  (WSNs),  numerous 
sensors  can  produce  a significant  portion  of  the  big  data.  It 
remains  an  open  issue  how  to  timely  gather  and  transmit 
such  large  amount  of  data  while  minimizing  data  latency 
through  wireless  sensor  networks  (WSNs).  On  the  other  hand, 
spatially  correlated  sensor  observations  lead  to  considerable 
data  redundancy  in  the  network.  To  efficiently  eliminate  data 
redundancy  and  improve  energy  efficiency,  in  this  paper, 
based  on  the  fact  that  the  more  similar  the  measure  data 
are,  the  smaller  the  amount  of  data  after  aggregation  is,  we 
first  develop  a new  distributed  clustering  algorithm  which  can 
categorize  sensor  nodes  with  high  similarity  into  a cluster  for 
data  aggregation,  while  ensuring  uniform  energy  consump- 
tion within  the  cluster.  Then,  we  propose  a data  aggregation 
algorithm  based  on  principal  component  analysis  (PCA) 
which  can  be  executed  in  the  cluster  head  (CH).  Finally, 
our  experimental  results  demonstrate  that  the  amount  of 
data  transmission  can  be  significantly  reduced  based  on  our 
proposed  clustering  and  data  aggregation  algorithm. 

Index  Terms — Data  aggregation,  big  data,  wireless  sensor 
networks,  principal  component  analysis. 

I.  Introduction 

In  wireless  sensor  networks  (WSNs),  numerous  sensors 
can  produce  a significant  portion  of  the  big  data  when 
they  are  used  extensively  in  many  military  and  civilian 
applications,  such  as  target  tracking,  environment  moni- 
toring, health  monitoring,  and  observing  phenomena  [1]. 
Usually,  the  big  data  contain  high  volume,  high  velocity, 
and  high  variety  information  assets,  which  are  difficult  to 
collect,  process,  and  transmit  by  using  existing  algorithms 
and  models.  In  such  applications,  if  the  sink  node  wants 
to  know  the  real-time  information  in  the  sensing  region, 
data  transmission  with  low  latency  is  very  crucial  [2]. 
However,  how  to  transmit  such  large  amount  of  data  while 
minimizing  data  latency  still  remains  a challenging  issue 
in  wireless  sensor  networks  [3]. 

A number  of  compression  methods  such  as  compressed 
sensing  (CS)  [4],  principal  component  analysis  (PCA) 
and  wavelet  compressive  [5]  have  been  proposed.  Wavelet 
compressive  is  a powerful  tool  for  non- stationary  signal.  In 
data  transmission,  instead  of  sending  raw  sensing  data,  the 
way  of  wavelet  transform  transmits  the  wavelet  coefficients 
to  the  sink  node.  At  last,  the  sink  node  can  recover 
the  full  raw  data  set  by  an  inversion  transformation.  In 
practical  application,  there  are  a variety  of  wavelet  bases 
need  to  be  chosen.  However,  once  the  wavelet  base  is 
selected,  its  characteristic  is  fixed,  and  it  is  difficult  to 
accurately  approximate  the  local  signal  characteristics  at 
different  scales.  Compressed  sensing  (CS)  is  a complex 


compression  method.  CS  projects  the  high  dimension  into 
low  dimension  space  to  reduce  the  transmission  of  data.  CS 
can  achieve  a precise  recovery  with  fewer  measurements 
than  the  dimension  of  the  raw  data.  The  complexity  of 
the  encoding  process  of  CS  is  very  low,  thus  CS  is  often 
used  in  data  aggregation  [6].  However,  the  challenge  of 
CS  is  decoding  computation.  The  decoding  with  high 
complexity  not  only  requires  high  computing  capacity,  but 
also  causes  longer  processing  delay.  In  the  case  of  applying 
CS  into  large-scale  WSNs,  the  excessive  requirements 
for  computing  capacity  and  processing  delay  will  greatly 
restrict  the  application  of  CS.  Compared  with  CS,  principal 
component  analysis  (PCA)  can  be  executed  in  the  cluster 
head  (CH),  which  can  quickly  process  the  data,  and  be 
used  in  exchange  for  the  most  effective  data  aggregation 
of  a minimal  energy  and  delay.  Thus  PCA  would  be  a 
promising  solution  in  large  scale  WSNs. 

On  the  other  hand,  the  sensing  regions  among  sensor 
nodes  in  WSNs  are  usually  overlapped  and  dependent  spa- 
tially, which  makes  the  observed  data  have  a certain  spatial 
correlation  [7]  [8].  Therefore,  spatially  correlated  sensor 
observations  lead  to  considerable  data  redundancy  in  the 
network.  To  efficiently  reduce  data  latency  and  increase 
energy  efficiency  in  data  transmission,  it  is  highly  desirable 
to  eliminate  such  data  redundancy  through  effective  data 
aggregation.  To  this  end,  researchers  have  proposed  various 
data  aggregation  approaches.  LEACH  is  a typical  cluster- 
ing protocol  [9].  It  achieves  energy  saving  by  changing  the 
structure  of  the  network,  however,  it  is  pre-selected  cluster 
head  node  and  CHs  are  fixed  until  the  end  of  the  life  cycle 
of  the  network.  In  [10],  Wang  et  al  designed  the  single-hop- 
length  (SHL)  and  multiple-hop-length  (MHL)  schemes 
for  optimal  aggregation  throughput,  and  considered  the 
tradeoff  between  aggregation  throughput  and  gathering 
efficiency.  In  [11],  Hua  et  al  presented  an  optimal  routing 
and  data  aggregation  scheme  by  exploiting  the  special 
structure  of  the  sensor  network.  In  [12],  Barton  et  al 
showed  that  data  aggregation  rate  of  0(logn/n)  per  node 
is  optimal.  Liu  et  al  [13]  explored  temporal  correlation  in 
each  cluster,  but  they  only  send  a part  of  the  sensed  data 
to  the  sink  node,  which  cannot  accurately  represent  the 
specific  information  in  the  network.  In  [14],  the  authors 
proposed  a clustering  approximation  framework  based  on 
a grid-based  spatial  correlation  clustering  method  which 
clusters  the  sensor  nodes  according  to  data  correlation.  It 
can  really  reduce  the  transmission  of  data  at  the  cost  of  data 
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accuracy.  However,  the  aforementioned  algorithms  ignore 
the  characteristics  of  data  themselves. 

In  this  paper,  to  efficiently  transmit  the  sensed  big  data 
with  low  latency,  we  propose  a principal  component  anal- 
ysis (PCA)  based  data  aggregation  algorithm  to  eliminate 
data  redundancy  in  cluster  head  nodes  so  as  to  minimize 
the  amount  of  data  transmitted.  Clearly,  the  more  similar 
the  sensed  data  of  nodes  are,  the  smaller  the  amount  of  data 
after  aggregation  is  for  a given  normalized  reconstruction 
error  in  PCA.  Based  on  this  consideration,  therefore,  we 
give  a definition  of  data  similarity  suitable  for  principal 
component  analysis.  Compared  to  existing  aggregation 
algorithms,  which  focuses  on  locations  of  sensors,  our 
algorithm  pays  more  attention  to  the  similarity  of  data 
themselves. 

The  contributions  of  this  paper  can  be  summarized  as 
follows. 

• Firstly,  in  order  to  balance  the  energy  consumption 
among  clusters  and  avoid  data  conflicts  within  a 
cluster,  we  find  the  optimal  number  of  members  in 
each  cluster. 

• Moreover,  we  propose  a distributed  clustering  algo- 
rithm based  on  the  similarity  to  put  the  nodes  with 
high  similarity  into  the  same  cluster. 

• Furthermore,  on  the  basis  of  the  constructed  cluster, 
we  propose  the  data  aggregation  algorithm  based  on 
PCA  on  cluster  head  nodes  to  deal  with  the  similar 
data  from  the  cluster  members. 

• Finally,  we  verify  by  simulation  that  our  algorithm  can 
efficiently  minimize  the  amount  of  data  transmission 
on  cluster  heads  while  ensuring  low  latency  of  data 
transmission. 

The  rest  of  this  paper  is  organized  as  follows.  Section 
II  analyzes  the  optimal  number  of  members  in  a cluster 
by  an  energy  model.  Section  III  proposes  a clustering 
algorithm  based  on  data  similarity  and  Section  IV  proposes 
the  data  aggregation  algorithm  based  on  PCA.  Section 
V evaluates  the  performance  of  the  proposed  algorithm. 
Finally,  Section  VI  concludes  this  paper. 

II.  System  Energy  Model 

A large  number  of  sensors  in  wireless  sensor  networks 
will  produce  a large  amount  of  data.  These  sensed  data  are 
gathered  into  big  data  [15].  When  sink  node  wants  to  spend 
the  least  energy  to  obtain  all  of  the  information,  we  have  to 
first  understand  the  distribution  of  the  big  data.  There  are 
two  ways  to  get  the  distribution  of  the  data.  One  is  that 
the  sensor  node  processes  and  extracts  the  related  data, 
then  transmits  the  processed  data  to  sink  node.  The  other 
is  that  when  transmitting  the  big  data,  all  nodes  transmit 
their  raw  data  to  the  sink  node.  The  sink  node  deals  with 
these  data,  and  makes  the  best  choice  for  the  network.  In 
this  paper,  we  adopt  the  second  approach  to  find  the  data 
of  the  neighbor  nodes  with  high  similarity,  and  put  the  data 
together  to  process  with  the  data  compression  of  the  PCA. 


We  consider  a sensor  network  consisting  of  N sensor 
nodes  that  are  randomly  distributed  in  an  M x M sensing 
area.  We  assume  that  each  node  has  l bytes  of  data  to 
be  sent  to  the  sink  node,  and  a simple  model  for  the  radio 
transmission  of  energy  consumption.  The  energy  consumed 
by  transmitting  l bytes  in  each  node  is  given  by  [14] 

jjj  n f l * E elec  + l * Efs  * d , if  d <C  do 

’ ' I [ Z * Eeiec  + / * Earnp  * d4  , otherwise, 

(1) 

where  d is  the  distance  between  transmitter  and  receiver 
and  do  represents  the  communication  radius  of  the  node. 
Eeiec  is  energy  consumption  of  the  circuit  per  byte  by  the 
transmitter  and  the  receiver,  and  Efs  * d2  is  the  energy 
consumption  of  the  amplifier  in  the  communication  range. 
Within  the  communication  range  of  a node,  it  maintains  a 
certain  power  operation.  When  the  communication  distance 
is  beyond  the  communication  scope  of  sensor  node,  the 
node  needs  to  increase  the  transmission  power.  Eamp  * d4 
denotes  the  energy  consumption  of  the  amplifier  beyond 
the  communication  range. 

Moreover,  we  divide  the  network  into  clusters  based 
on  data  similarity  of  nodes,  and  non-cluster  head  node 
transmits  the  data  to  the  cluster  head.  Let  Epr  be  the  energy 
consumed  by  the  cluster  head  aggregating  one  byte  data 
from  its  cluster  members.  Then  we  have 

Ep  = k * l * Epr:  (2) 

where  Ep  represents  energy  consumption  of  data  aggre- 
gation in  a cluster  head,  and  k is  the  number  of  cluster 
members.  Clearly  Ep  is  proportional  to  l.  In  data  recep- 
tion, the  energy  consumption  for  receiving  l bytes  of  data 
for  each  node  can  be  given  by 

ER(l,d)  = l * Eeiec.  (3) 

Energy  consumption  sources  of  each  cluster  includes  the 
data  transmission,  data  reception  and  data  processing 
section.  Therefore,  we  can  easily  formulate  the  energy 
consumption  in  the  cluster  as 

E cluster  — Ep(l,  d)  + Er{1,  d)  + Ep  (4) 
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+ Id  * l * Eeiec  + P * l * Earnp  * d40) SINK 
+ k * l * Eeiec  T*  k * / * Epr 
= 2 /c  * Z * Eeiec  + k * l * Efs  * dtoCH 
+ ft  * l * Eeiec  + ft  * l * Earnp  * d4QpjNK 
+ k * Z * Epr , 

where  dtoCH  and  dtosiNK  denote  the  distance  from  the 
node  to  the  cluster  head  and  the  sink  node,  respectively. 
We  assume  that  the  data  aggregation  is  perfect  aggregation. 
/3  * l denotes  the  size  of  data  after  aggregation.  According 
to  [16],  the  expected  value  of  squared  distance  d^oCH  can 
be  denoted  by 

\dtoCH>  27 rN/k  27 r N ' (5) 
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Moreover,  the  total  energy  consumption  for  one  round  of 
data  collection  is  given  by 


_ N 

Etotal  = * Ech 

k M 2 

= 2 TV  * l * Eeiec  + TV  * l * Efs  * 

eLec  T 2tt  N 

N N 

+ ~L  * P * l * Eelec  + ~r  * P * ^ * E{ 
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Then  we  can  obtain  the  optimal  number  of  members  in  a 
cluster  by  taking  the  first-order  derivative  of  Etotal  with 
respect  to  k and  letting  it  be  zero,  i.e., 


27T  (N/3Eeiec  + N/3EamptfoSINK) 


EfsAP 


It  is  not  difficult  to  observe  that  the  optimal  number  of 
members  in  a cluster  is  related  to  the  total  number  of  sensor 
nodes  in  the  network,  the  distance  between  the  sensor  node 
and  sink  node,  and  the  size  of  the  sensing  region.  kopt  is 
the  theoretically  optimal  value  and  we  will  use  specific 
simulations  to  validate  the  value. 


respectively.  Data  correlation  between  nodes  1 and  2 can 
be  defined  as 


corr(X  i,X2) 
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We  assume  that  X1  and  X2  are  similar  if  they  can  meet 
the  following  metric 


corr(X i,X2)  < e.  (10) 


If  the  similarity  measure  of  a sensor  node  satisfies  the 
data  magnitude  similarity  and  data  correlation,  i.e.,  the 
similarity  measure  is  below  a given  threshold  c and  e,  we 
say  they  are  similar. 


III.  Clustering  Algorithm  with  Data  Similarity 
A.  Similarity  Measurement 

In  order  to  ensure  the  accuracy,  we  often  need  the 
real-time  data  transmission.  Due  to  high  spatiotemporal 
correlation,  the  sensed  data  by  different  sensor  nodes 
are  similar.  Thus,  we  will  propose  a clustering  algorithm 
based  on  the  data  similarity  for  data  aggregation.  In  the 
following,  we  define  the  data  similarity  from  two  aspects: 
data  magnitude  similarity  and  data  correlation. 

1)  Data  magnitude  similarity:  We  assume  that  D de- 
notes the  difference  of  current  data  of  nodes,  and  c denotes 
a measurement  threshold.  We  can  set  the  appropriate  value 
of  c according  to  the  actual  requirements,  i.e., 

c — Dx  kk,  (8) 

where  kk  denotes  the  similarity  coefficient,  which  reflects 
the  influence  magnitude  of  the  difference  of  data  on  the 
similarity.  We  can  change  the  values  of  kk  by  need. 
According  to  the  characteristics  of  our  collected  data,  we 
let  kk  = 0.3.  If  the  measured  values  of  the  sensed  data 
from  two  neighboring  nodes  are  less  than  c,  their  data 
meet  the  magnitude  similarity.  Within  one  hop  of  data 
transmission,  it  is  easy  to  prove  that  the  difference  of  the 
node  measurements  in  the  same  cluster  is  less  than  2 * c. 

2)  Data  correlation:  We  assume  that  Xn  = 
{xni,  xn2l  • • • , xnt}  represents  the  observation  of 
node  n,  where  the  series  1,2,  ...,t  is  a sampling 
time  slots.  And  we  assume  that  nodes  1 and  2 are 
neighbors  and  their  observations  can  be  represented  by 
X\  — {xn,xi2, . . .,xit}  and  X2  = {x2i,x22, . . . , x2t}, 


B.  Clustering-compression  algorithm 

In  this  subsection,  given  the  similarity  values,  we  pro- 
pose a heuristic  clustering  algorithm  suitable  for  the  PCA 
based  aggregation  algorithm  to  partition  the  sensor  nodes 
with  high  similarity  into  clusters,  and  select  an  appropriate 
sensor  as  the  cluster  head  (CH)  in  a cluster. 

Our  heuristic  clustering  algorithm  can  be  described  as 
follows.  Each  sensor  node  perceives  environmental  data 
at  a fixed  interval,  and  the  perceived  data  constitute  the 
observation  metric  X.  First,  each  node  will  calculates  the 
similarity  with  its  neighbor  nodes.  If  node  u and  node  v 
satisfy  a given  similarity  threshold  e,  then  they  will  set  up 
an  edge  uv.  Such  that,  all  sensor  nodes  can  form  a graph 
G.  Then,  we  will  sort  nodes  by  their  degrees,  and  the  node 
with  the  largest  degree  is  selected  as  the  cluster  head.  To 
minimize  energy  consumption  in  each  cluster,  we  constrain 
the  number  of  members  in  each  cluster  to  k — 1.  The  value 
of  k is  equal  to  kopt  in  Eq.  (7).  Then,  the  cluster  head 
(CH)  will  select  k — 1 nodes  with  higher  similarity  from 
its  neighbor  nodes,  and  we  remove  the  selected  nodes  from 
the  set  S of  all  the  nodes.  The  clustering  procedure  repeats 
until  the  largest  degree  of  nodes  in  S is  less  than  k — 1.  In 
general,  there  remain  some  nodes  in  S.  In  this  situation, 
we  will  not  activate  the  computation  of  similarity  (c  and 
e),  but  reduce  the  number  of  cluster  members  so  as  to 
make  the  remaining  node  become  a cluster.  Once  clusters 
become  stable,  cluster  members  will  make  the  cluster  head 
rotation.  When  the  sink  node  finds  that  intercluster  data 
have  greater  difference  than  a given  threshold  a few  times 
or  half  of  the  nodes  do  not  satisfy  the  similarity  values,  it 
will  decide  to  renew  to  activate  the  clustering  algorithm. 
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Algorithm  1 Similarity  Clustering  Algorithm 

Input:  N nodes  in  set  S 

Output:  Clusters  with  k members  and  clusters  with  less 
than  k members; 

1:  Each  node  records  the  number  of  its  neighbor  nodes; 
2:  while  k > 1 & S = 0 do 

3:  for  j = 1 -A  n do 

4:  Node  j in  S measures  the  similarity  based  on 

our  proposed  criteria,  i.e.,  the  difference  of  the 
measurement  with  its  neighbor  nodes  is  less  than 
c,  and  corr(Xj , Xj/nejjgiliDorno(ie ) is  less  than  e, 

5:  Connect  node  j with  its  neighbor  nodes  which 

satisfy  the  given  similarity  threshold  c and  e; 

6:  Calculate  the  degree  of  node  j; 

7:  end  for 

8:  repeat 

9:  Rank  the  nodes  according  to  their  degrees  in  S ; 

10:  Select  the  node  with  largest  degree  as  the  CH 

node; 

ll:  Successively  select  k — 1 neighbor  nodes  with  the 

larger  similarity  value  as  cluster  members; 

12:  Remove  the  CH  node  and  its  cluster  members 

from  S; 

13:  Update  the  degree  of  each  remaining  node  in  S. 

14:  until  the  largest  degree  falls  below  k — 1 

15:  k — k — 1; 

16:  Let  n equal  to  the  number  of  elements  in  the  current 

set  S. 

17:  end  while 


Fig.  1.  Similarity  measurement. 


Fig.  2.  Clustering  with  d = D * 0.3. 


The  pseudo-code  procedure  of  the  algorithm  is  given  in 
Algorithm  1. 

Fig.  1 shows  the  results  of  the  similarity  measurement 
while  Fig.  2 shows  the  results  of  the  proposed  clustering 
algorithm  1,  where  we  set  c — 20  and  e = 0.8,  the  number 
in  the  circle  denotes  the  number  of  nodes,  the  number 
outside  the  circle  indicates  the  current  measured  data, 
and  the  source  of  data  can  be  found  in  the  experimental 
Section  V.  We  can  observe  from  Fig.  2 that  the  similarity 
measurement  of  adjacent  nodes  are  similar  in  the  same 
cluster  after  clustering. 


IV.  Data  Aggregation  based  on  Principal 
Component  Analysis 

In  this  section,  we  propose  a data  aggregation  algorithm 
based  on  principal  component  analysis. 

Principal  component  analysis  (PCA)  is  a useful  com- 
pression algorithm,  and  is  quite  suitable  for  data  aggrega- 
tion in  sensor  nodes  with  limited  computation  capacity  in 
WSNs  [17].  PCA  transforms  the  sensed  data  into  a new 
coordinate  system  and  makes  eigenvector  of  the  maximum 
eigenvalue  become  the  first  coordinate  (called  the  first 
principal  component),  the  second  one  become  the  second 
coordinate  (called  the  second  principal  component),  and  so 
on  [18].  Therefore,  PCA  can  reduce  the  dimension  degree 
of  data  sets  while  keeping  the  characteristics  of  the  largest 
contribution  to  the  variance. 

In  the  following,  we  introduce  the  data  compression 
method  based  on  PCA.  A wireless  sensor  network  can 
be  divided  into  many  clusters  by  our  proposed  clustering 
method.  Cluster  head  (CH)  will  collect  measurement  data 
of  its  members  and  then  put  the  measurement  data  into 
the  observation  matrix  x.  The  observation  matrix  x can  be 
transformed  into  a new  space  by 

y = Px,  (11) 

where  P denotes  an  m x n orthogonal  transformation 
matrix,  x indicates  an  nxn  matrix,  and  m is  much  smaller 
than  n.  Then  we  can  obtain  a low-dimension  projection 
matrix  y.  If  we  would  like  to  find  a suitable  matrix  P,  x 
can  be  reconstructed  based  on  the  following  equation 

x — PTy.  (12) 


Since  the  cluster  head  node  only  sends  y to  its  destina- 
tion, instead  of  the  high-dimension  x,  the  amount  of  the 
transmitted  data  can  be  reduced  remarkably,  which  will 
further  reduce  the  energy  consumption  of  data  transmission 
[19].  Because  the  dimension  of  y is  reduced,  and  x is 
only  an  approximation  of  x,  we  consider  the  normalized 
reconstruction  error  defined  as 


7 = 


X — X 2 


X 


(13) 


where  the  vectors  x and  x represent  the  original  and 
recovered  data,  respectively. 
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It  is  clear  that  the  reconstruction  error  7 is  determined 
by  transformation  matrix  P.  The  matrix  P comes  from 
covariance  matrix  C,  and  each  column  vector  of  matrix  P 
is  the  eigenvector  of  C [20] 

C = E[(x  — E[x])(x  — E[x\)t\.  (14) 

Let  A indicate  the  eigenvalue  vector  of  the  covariance 
matrix  C , where  the  eigenvalues  are  nonnegative  real  num- 
bers since  any  covariance  matrix  is  nonnegative  definite 
[21].  q denotes  the  number  of  non-zero  eigenvalues  of  C 
(q  > m).  If  all  q eigenvectors  of  matrix  C are  used  as  the 
columns  of  P,  the  normalized  reconstruction  error  will  be 
minimized,  and  total  variances  of  data  in  all  directions  will 
be  saved,  and  y will  represent  all  the  principal  components 
of  x.  In  other  words,  in  order  to  ensure  the  accuracy  of 
data  reconstruction,  we  must  limit  the  size  of  projection 
space.  The  CH  will  use  the  following  formula  to  measure 
the  accuracy  of  data  reconstruction 

T(m)  = '£*=1  f . (15) 

where  Xk  denotes  the  k-th  largest  eigenvalue.  The  pseudo- 
code procedure  of  the  data  aggregation  algorithm  is  giv- 
en in  Algorithm  2.  We  let  dvaiue  denote  the  difference 
between  the  maximum  and  minimum  elements  of  the 
observation  matrix  X 


dyalue  — 


max 


-X 


min  1 


(16) 


where  ATmaa,  and  ATmin  denote  the  maximum  and  min- 
imum elements  in  X , respectively.  We  let  p denote  the 
aggregation  ratio,  which  is  the  ratio  of  the  size  of  data  after 
aggregation  to  that  before  aggregation.  Fig.  3 depicts  the 
relationship  between  the  normalized  reconstruction  error 
and  the  aggregation  ratio  for  different  dvaiue.  It  is  not 
difficult  to  find  from  Fig.  3 that  (i)  the  normalized  recon- 
struction error  decreases  as  the  aggregation  ratio  increases; 
(ii)  the  smaller  the  value  of  dva\ue  is,  the  smaller  the 
aggregation  ratio  is  for  a given  reconstruction  error.  When 
the  error  rate  is  5%,  the  aggregation  ratio  of  dvaiue  = 50 
is  22%.  However,  the  aggregation  ratio  of  dvaiue  = 150 
is  44%  and  the  aggregation  ratio  of  dvaiue  — 100  is 
40%.  In  this  case,  it  will  save  around  half  storage  space. 
Thus  the  data  similarity  has  a great  influence  on  the  data 
compressed. 


V.  Performance  Evaluation 

In  this  section,  we  present  numerical  results  to  verify 
the  performance  of  our  proposed  clustering  and  PCA  based 
data  aggregation  algorithm.  In  addition,  we  will  show  that 
our  method  can  reduce  the  amount  of  data  transmission 
while  ensuring  a certain  accuracy. 

We  randomly  deploy  a WSN  with  N — 80  sensors 
in  a 100  * 100m  region  to  sample  illumination  intensity. 
The  communication  range  of  the  sensor  is  a circular  area 
within  a radius  of  20  meters.  The  value  of  the  illumination 


Fig.  3.  Features  of  Principal  Component  Analysis. 

Algorithm  2 Data  Aggregation  Algorithm 

Input:  Measurement  data  of  clusters 

Output:  Aggregated  data  matrix; 

l:  Cluster  head  puts  measurement  data  of  its  members 
into  observation  matrix  x; 

2:  Calculate  the  covariance  matrix  C = E [{x  — E [x] ) (x  — 
E[x])t]  of  x\ 

3:  Calculate  the  eigenvalues  and  the  corresponding  eigen- 
vectors of  matrix  C\ 

4:  Rank  the  eigenvalues  and  get  the  largest  one; 

5:  Select  m eigenvectors  corresponding  to  the  largest 
eigenvalue  to  form  the  transformation  matrix  P; 

6:  Compute  the  projection  matrix  y — Px\ 

7:  Send  the  projection  matrix  y to  the  sink  node. 


intensity  is  produced  by  superposition  of  7 Gaussian 
distributions  [22] 

7 

x{i,j,t)  = W3^2(gflk^k(i,j)e~J^  cos(27r/fct)),  (17) 

k= 1 

where  (i,  j)  denotes  a position  of  the  sensing  area.  Then, 
x(i,j,t)  indicates  the  current  illumination  intensity  of 
position  (i,j)  at  time  t. 

In  our  simulation,  we  assume  7 = 2,  fi  = 0.1Hz,  — 

0.2Hz,  data  means  — [3,2],  p2  = [3,2],  covariance 
matrixes  £i  = [10,  2;  2,  9],  £2  — [8,  2;  2, 10],  and  = 
10, 82  — 15.  Such  data  can  be  a good  summary  of  the 
data  distribution  of  a variety  of  situations.  Fig.  4 depicts 
the  current  value  of  data  in  each  node.  Other  parameter 
settings  are  listed  in  Table  I. 

A.  Evaluation  of  proposed  clustering  algorithm 

According  to  the  parameter  settings  in  Table  I,  we  can 
expect  the  optimal  value  3 < kopt  < 17  for  80-node 
network,  where  we  set  c = 50,  £ = 0.8.  Thus  we  vary 
the  number  of  cluster  members  between  5 and  14. 

Fig.  5 shows  the  average  energy  consumption  per  round 
with  more  than  95  percent  accuracy.  It  is  not  difficult  to 
observe  that  the  experimental  results  and  theoretical  anal- 
ysis are  very  similar,  i.e.,  the  optimal  number  of  clusters 
by  experiment  is  about  10-15  for  80-node  network  and  the 
optimal  number  of  clusters  by  theoretical  analysis  is  11. 
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TABLE  I 

Variable  parameters 


Parameter 

Value 

Parameter 

Value 

Belec 

50nJ/bit 

N 

80 

Eamp 

lpj/bit/m4 

M 

100 

dtoSINK 

10~80m 

Efs 

100pJ/bit/m2 

Fig.  4.  Current  data  of  nodes. 

When  there  is  only  one  cluster  in  the  network,  non-cluster 
head  node  will  waste  a lot  of  energy  on  data  transmission. 
When  there  are  too  many  clusters  in  our  network,  there 
are  not  much  local  data  for  data  compression,  which  will 
influence  the  performance  of  compression  algorithm. 

Next,  we  employ  the  similarity  criteria  given  in  Section 
III- A to  compute  the  similarity  among  nodes.  Fig.  6 
depicts  the  network  topology  obtained  by  similarity  among 
sensors,  in  which  the  line  connection  between  two  nodes 
indicates  that  their  data  are  similar.  We  can  observe  that 
nodes  can  effectively  connect  with  their  neighbor  nodes 
with  high  similarity  measurement,  which  means  that  our 
proposed  similarity  criteria  can  effectively  eliminate  data 
dissimilarity  in  a cluster.  On  the  basis  of  the  constructed 
cluster,  our  data  aggregation  algorithm  based  on  PCA  on 
cluster  head  nodes  can  deal  with  the  similar  data  well. 

Fig.  7 illustrates  the  results  after  clustering.  It  is  clear 
that  sensor  nodes  with  higher  similarity  can  be  organized 
effectively  into  a cluster  by  our  method,  and  the  scope 
of  data  has  also  been  well  controlled  in  each  cluster 
and  the  data  are  also  relatively  close  between  the  node 
in  one  cluster.  In  this  case,  our  clustering  algorithm  can 
effectively  enhance  the  performance  of  the  PCA  based  data 
aggregation  algorithm  to  reduce  the  amount  of  transmitted 
data  and  the  difficulty  of  data  processing.  The  amount  of 
aggregated  data  will  be  greatly  less  than  the  amount  of  the 
raw  data  generated  by  the  sensor. 

Another  observation  is  that  we  can  change  the  threshold 
c,  £ and  the  normalized  reconstruction  error  according 
to  actual  requirements.  When  we  relax  the  thresholds, 
there  are  more  opportunities  for  connecting  more  neighbor 
nodes.  This  helps  the  network  effectively  reduce  computa- 
tional cost  since  it  leads  to  fewer  clusters  in  the  network. 

B.  Comparison  of  data  aggregation  algorithms 

In  this  subsection,  we  compare  the  energy  consumption 
of  LEACH  algorithm  [16],  K -means  with  principal  com- 


Fig.  5.  The  average  energy  consumption  with  N = 80  sensors  in  a 
100m  * 100m  region 
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Fig.  6.  Network  topology  with  80  sensors  distributed  in  a 100*100  area. 


ponent  analysis  (PCA)  algorithm  and  our  PCA  based  data 
aggregation  algorithm.  The  reason  that  we  choose  LEACH 
algorithm  is  that  our  clustering  algorithm  is  affected  by 
LEACH  algorithm.  K -means  with  PCA  algorithm  is  to 
compute  K -means  of  observation  data  in  the  construction 
of  transformation  matrix  P.  In  our  simulation,  each  node 
begins  with  2 J of  energy,  and  communication  capability  of 
node  is  adjustable.  Fig.  8 shows  the  relationship  between 
energy  consumption  and  the  number  of  nodes.  When  the 
number  of  nodes  is  less  than  300,  the  energy  consumption 
of  our  algorithms  is  similar  to  LEACH  and  K -means  PCA 
algorithm.  When  the  number  of  nodes  slowly  increases,  the 
amount  of  data  increase  as  well.  Our  proposed  algorithm 
can  reduce  significantly  The  energy  consumption  compared 
with  other  two  methods.  This  is  because  (i)  our  similarity 
clustering  approach  has  laid  a good  foundation  for  the  data 
compression;  (ii)  by  taking  advantage  of  the  similarity  of 
data  in  network  clustering,  we  can  effectively  reduce  the 
network  energy  consumption  for  the  given  data  accuracy. 

VI.  Conclusions 

In  this  paper,  based  on  the  similarity  of  data  among 
adjacent  nodes,  we  propose  a distributed  clustering  al- 
gorithm which  can  effectively  organize  the  nodes  with 
high  similarity  into  a cluster  for  data  aggregation,  while 
ensuring  uniform  energy  consumption  within  the  cluster. 
Moreover,  we  propose  the  data  aggregation  algorithm 
based  on  principal  component  analysis  (PCA)  which  can 
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Fig.  7.  Cluster  structure  by  proposed  clustering  algorithm  with  d = 50. 


be  executed  in  the  cluster  head.  In  particular,  the  proposed 
data  aggregation  algorithm  will  be  more  effective  when 
it  is  combined  with  the  clustering  algorithm.  Finally,  our 
experimental  results  illustrate  that  the  proposed  algorithm 
can  effectively  reduce  the  amount  of  data  transmission  and 
energy  consumption  in  the  network. 
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